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ABSTRACT 

We describe several polygon compression techniques to 
enable efficient transmission of polygons representing ge¬ 
ographical targets. The main application is to embed 
compressed polygons to emergency alert messages that 
have strict length restrictions, as in the case of Wireless 
Emergency Alert messages. We are able to compress 
polygons to between 9.7% and 23.6% of original length, 
depending on characteristics of the specific polygons, re¬ 
ducing original polygon lengths from 43-331 characters 
to 8-55 characters. The best techniques apply several 
heuristics to perform initial compression, and then other 
algorithmic techniques, including higher base encoding. 
Further, these methods are respectful of computation 
and storage constraints typical of cell phones. Two of 
the best techniques include a “bignum” quadratic combi¬ 
nation of integer coordinates and a variable length encod¬ 
ing, which takes advantage of a strongly skewed polygon 
coordinate distribution. Both techniques applied to one 
of two “delta” representations of polygons are on aver¬ 
age able to reduce the size of polygons by some 80%. A 
repeated substring dictionary can provide further com¬ 
pression, and a merger of these techniques into a “polyal- 
gorithm” can also provide additional improvements. 

1. INTRODUCTION 

Geo-targeting is widely used on the Internet to better 
target users with advertisements, multimedia content, 
and essentially to improve the user experience. Such in¬ 
formation helps in marketing brands and increasing user 
engagement. Scenarios also exist where geo-targeting 
at a given time becomes imperative for specific sets of 
people for information exchange, thereby contributing to 
problems of network congestion and effectiveness. Emer¬ 
gency scenarios are a quintessential example where peo¬ 
ple in the affected area need to be informed and guided 
throughout the duration of an emergency. To address 
this. Wireless Emergency Alerts (WEA) is a nation-wide 
system for broadcasting short messages^ (currently 90 
characters, similar to SMS messages) to all phones in a 
designated geographic area via activation of appropriate 
cell towers. The area is typically identified by a poly- 
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gon, though currently many operators use rather coarse¬ 
grained targeting (such as to a whole county). 

Our research group has developed and evaluated im¬ 
proved geo-targeting technology. For testing purposes, 
we ran trials of the new system, called by us WEA+. 
We are using SMS (and WiFi) to simulate cell broadcast, 
and also have on campus an experimental cell system, 
CROSSMobile [10], that supports true cell broadcast on 
an unused GSM frequency to cell phones with a special 
SIM card. 

In order to do this, we included some compressed poly¬ 
gon representation as part of the short message text, 
expected to be feasible in both current 90 WEA char¬ 
acter messages and even more effective in future imple¬ 
mentations of WEA that allow longer messages, or use 
multiple messages. We have available to us a corpus of 
11,370 WEA messages sent out by the National Weather 
Service (NWS) [13]^. The polygons in the NWS corpus 
range from 4-24 points, with a size ranging from 43-331 
characters. Since WEA messages are broadcast to thou¬ 
sands of people, and it is believed that adding a polygon 
to more precisely define the target area is critical, it is es¬ 
sential to be able to compress typical WEA polygons to 
fit within the current or anticipated future WEA message 
length, leaving room for meaning text as well. In this pa¬ 
per, we explore several ways of substantially compressing 
such polygons using heuristics and standard algorithms. 
Our techniques provide better compression for almost all 
polygons in the corpus than standard algorithms. 

The compression problem we are tackling here is quite 
different from that described in most other published re¬ 
search on polygon compression [1,6]. They typically are 
dealing with a large number of inter-connected polygons 
in a 2D or 3D representation of a surface or solid, and 
thus are compressing a large number of polygons at the 
same time. Many of these polygons share common points 
and edges, which can be exploited in the compression; in 
our case, we have a single, relatively small polygon to 
compress, and so can not amortize items such as a dic¬ 
tionary of common points. 

In designing our techniques, we have looked at the typ¬ 
ical distribution of coordinate values, polygon sizes and 
character string length of original uncompressed poly¬ 
gons, shown in Figures 2a and 2b. Many of these charac- 


^These messages were sent out by the NWS in 2012, 
2013, and through December, 2014 
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Figure 1: A map showing 3 polygons (yellow border). The NWS dataset has both convex and non-convex polygons 


teristics have motivated our transformations of the orig¬ 
inal polygon. 

In the NWS corpus the observed range of GPS coor¬ 
dinates, covering a signihcant portion of the USA, is: 

(X,Y) = (17.67,-159.32) to (48.84,-64.56) 

We use three types of transformations on the numeric 
strings representing a polygon. The first group exploits 
heuristics, redundancies and patterns in the GPS coordi¬ 
nates, substantially simplifying and reducing the number 
of numeric characters in most polygon coordinates. The 
next set of transformations encodes the now simplihed 
coordinates in a higher base, such as all alpha-numeric 
characters, and then applies arithmetic and character 
string operations to further compress each polygon. The 
hnal set of transformations uses the statistical nature 
of the entire set of polygons to further compress some 
polygons. 

We evaluated our corpus of polygons with combina¬ 
tions of the following compression techniques: 

1. Purely heuristic, using deltas and hxed or variable 
length fields 

2. Encoding in a higher base 

3. Using a repeated substring dictionary to replace 
most frequent occurring substrings 

4. Arithmetic operations to combine values 

5. Arithmetic encoding, an entropy based compres¬ 
sion technique 

6. Other standard algorithms - LZW [12], 7zip, gzip, 
Huffman [9] and Golomh [8] 


A key constraint is to compress the polygon into a 
string of characters that are acceptable via SMS or cell 
broadcast and specihcally to the gateways we use to send 
an SMS via email for our pilot trials. To compress the 
set of decimals (or bits) representing GPS coordinates, 
we use a base B representation, avoiding characters that 
are questionable. This is discussed further in the practi¬ 
cal considerations section. Base B = 62 is a convenient 
choice since it uses only alphanumeric characters [0-9a- 
zA-Z]. Using a higher base such as B = 70 or B = 90 
uses more characters and will improve the overall com¬ 
pression, and changes the trade-off between techniques. 
Briefly, our paper explores numerous techniques of com¬ 
pression on different transformations, or manipulations 
to the original polygon. 


2. HEURISTIC APPROACHES 

We have combined several heuristics, motivated by 
analysis and discovered experimentally to work well. The 
following describes our current techniques. We start with 
an original N point polygon, given as an ordered hnite 
sequence in R^: 

0=[Xi,Yi,...,Xn,Yn] 

where Xi,Yi are GPS coordinates in decimal degrees. 

In the NWS corpus, N G [4, 24], even though the NWS 
standard allows polygons of up to 100 points. The origi¬ 
nal uncompressed polygon length of 43 to 331 characters 
includes 2N — 1 separating commas, 2N periods and N 
minus signs. Since Xi is typically dd.dd and Yi is typi¬ 
cally —dd.dd or —ddd.dd, the total length of the original 
polygon string O is: 


2 





no. of vertices 



(a) Distribution of polygon vertices 


length(in characters) 

(b) Distribution of original polygon lengths 


Figure 2: Distribution of NWS data 


len{0) <= AN + bN+2N-1+ 2N + N 


Xi Yi 

<= IAN - 1 


commas periods minuses 


( 1 ) 


There are several steps we take to successively com¬ 
press the polygon. Initially we perform three simplifi¬ 
cations and transformations to the original set of coor¬ 
dinates O. The first transformation converts all coordi¬ 
nates to positive integers to yield, O'. The second finds 
the minimum x-coordinate and y-coordinate and takes 
the difference with every vertex The third 

transformation considers the difference of consecutive co¬ 
ordinates (T^). 


Step 1: Starting with polygon O, round all numbers to 
2 (or 3) decimals precision, convert to integers 
to drop the decimal point, and switch sign of Yi 
in USA, so both Xi and Yi are positive integers, 
to produce O' 

Ai = mt(100 * Xi)-Yi = -mt(100 * U) 

Outputting these with 2N — 1 separating com¬ 
mas gives a length of at most lliV — 1 characters 
as opposed to the original IAN — 1 (refer Eq. 1). 

Step 2 1 Compute Xrnin — infj Xi^ Y-jjiifi — Yi. 

Step 3: Compute deltas for all coordinates: 

dXi — Xi Xjjiifi 

dY — Y —Y 

^ I — J- I J- min 

where dXi and dYi are non-negative integers. 
^Following Mike Gerber of the NWS [7] 


Step 4: Compute deltas for Xmin and Ymin from a cho¬ 
sen “origin”, origin (Ao,yo): 

dXyyiiyi - Xfjiifi A^O 

dYmin — Ymin Yq 

We found (1600,6000) most effective (see Fig¬ 
ures 4a and 4b). 

Step 5: Since these are closed polygons, drop the last 
point (Xn^Yn) which is a duplicate of the first 
point, producing a shorter set of coordinates: 

yAmin ^ , dYmin, dXl, dYl, , 

..., dXN-1, dYN-i] 

Steps for the second transformation are very sim¬ 
ilar: 

Step 1: Round all numbers to 2 (or 3) decimals precision, 
convert to integers to drop the decimal point, 
and switch signs for U: 

Xi = mt(100 * Xi); Yi = —mt(100 * Yi) 

Step 2: Compute deltas for all coordinates: 

AAi+i = Ai+i — Xi 

AU+i = U+i - U 

Step 3: Compute deltas for Ai and Yi from a the chosen 
“origin”, (Ao,Yo) : 

(5Ai = Ai - Ao 


= Yi - Yo 

Here again we used (1600,6000) as “origin”. 
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(a) skewness = 1.63; kurtosis = 1.42 
dXi e [1,327] 



(c) skewness = 4.34; kurtosis = 21.17 
AXi e [1, 361] 



(b) skewness = 2.08; kurtosis = 5.34 
clYi e [1, 325] 



(d) skewness = 5.2; kurtosis = 31.56 
G [1,524] 


Figure 3: Positive skewness in the deltas dXi{3a) and dYi(3h). Heavy tails and peakedness define positive 
kurtosis [3]. kurtosis = 85.25 and 139.76 for dXi and dYi respectively if 14354 cases where dXi = 0 and 13838 cases 
where dYi = 0 are included. Kurtosis not very high with inclusion of XXi = 0 and XYi = 0 since their distributions 

have uniform fall off. 


Step 4: Many of the As are negative integers which causes 
problems for the compression techniques discussed 
below. Therefore, every XXi or XYi element e 
will be converted as follows: 

2e, if e >= 0 

—2e — 1, if e < 0 

Step 5: Drop the last point {Xn^Yn) which is a dupli¬ 
cate of the first point, producing a shorter set of 
coordinates: 

= [5Xi,5yi,AX2,Ay2,...,AXjv-i,Ayjv-i] 


Note that in addition to dropping the last point, we 
also save an additional point, since we start from 5Xi 
and 8Yi rather than the additional Xmin, Ymin in the 
rpAmin transformation. This form is particularly inter¬ 
esting, since the distribution of its XXi has a skewed but 
less peaked shape to that of the dXi in , but with 

a longer tail (See Figure 3). 


These heuristics already produce a substantial com¬ 
pression because of the limited ranges and skewed distri¬ 
bution of the delta polygon coordinates {dXi^ dYi), and 
(AXi, AYi) and of the starting points (dXmin^dYmin)^ 
and (SXi, (5Ti), shown in Figures 3, 4. While the ef¬ 
fectiveness of the various compression techniques work 
well because of these range and skew characteristics, 
many of the techniques are not strongly dependent on 
the specifics, and thus many would work well even for a 
somewhat different set of polygons. 

It is important to note that in both forms of the delta 
transformations T we have two different sets of integers 
to compress, with distinctly different ranges and distri¬ 
butions: the single starting point pair of dXmin and 
dYmin for (or and for T^) and the X- 1 

pairs dXi, dYi for (or N — 2 pairs AXi and AYi 

for T^). We will thus treat them separately to get the 
best results. 

As we shall see, for the NWS corpus two of the tech¬ 
niques are best, but we describe several others and the 
sub-transformations since some of these might perform 
better for other sets of polygons. 


4 







As a very first step to compress polygon, or 

could be directly encoded as a comma-delimited string: 

T^min ^ dXmin*. •dYmin*. •dXi., 

• dyi(2) 


= SXim, •SYim, •AX2«, 

• A1^2*5 • ... • AAjv_2«, •A1A/'_2 (3) 


Symbol • is used to denote string concatenation. Fig¬ 
ure 5 shows the distribution of lengths using this trans¬ 
formation. Since all deltas, dXi, dYi are less than 350, 
(and AAi, XYi are less than 550) each of these deltas 
can be encoded in at most three decimal digits, ddd, 
while dXmin or 8Xi will take at most four digits, dddd, 
and dYmin or 5Yi at most five digits, ddddd. The lower 
bound on the length of is thus: 


len{T^^^^) >= 





dXi dYi commas 


>= 4A - 1 

(4) 


and 4A^ — 5 for , though it is very rare in the NWS 
corpus for dXmin and dXmin to be encodable in a single 
digit. 

The upper bound on the length of is: 


len{T^^^^)<= ^ ^ +3(A-1) + 3(A-1) 

dXjy^i^ dY^i.^ dXi dYi 

+ 2N-1 (5) 

commas 

<= 8A + 2 


and 8N — 6 for . 

We can eliminate all commas by using fixed three digit 
helds for each delta, padded with zero, four digits for 
dXmin and SXi and five digits for dYmin and SYi to get 
a significant improvement (Figure 5b) over 
or len{T^), called or Tf: 

len{T^^^^)= ^ ^ +3(A-1)+3(A-1) 

dXmin dYmin dXi dYi ( 6 ) 

= 6A + 3 


and QN — 3 for , which is in between the upper and 
lower bounds on or T^. The skew in the delta 

values results in being usually better than . 


3. HIGHER BASE ENCODING 

For the next set of compression techniques we use a 
convenient base (B) transformation, Hb(‘), to repre¬ 
sent the encoded polygon. Encoding a large number in 
base 62 using alphanumeric characters [0-9A-Za-z] rather 
than just numeric digits [0-9] significantly reduces overall 
polygon string lengths. For example, one base 62 “bigit” 


b can represent an integer up to 61, two base-62 “bigits”, 
56, can represent an integer up to 3843, while three “big¬ 
its” bbb can represent an integer up to 238327, and so 
on. So each delta can be represented in one or two base 
{B >— 62) characters. A higher base, such as base 70, 
can represent even larger integers in fewer characters: a 
single base-70 “bigit” b can represent an integer up to 
69, two base-70 bigits bb can represent an integer up to 
4899, and three base-70 bigits bbb can represent integers 
up to 342999. Likewise, dXmin can be encoded in two 
base 70 bigits and dYmin can be encoded in two or three 
base 70 bigits. Because of the choice of values for Xq and 
To, the restricted ranges and skew of dXmin and dYmin 
allow most to pack within two base-70 characters. For 
our analysis and experiments of compression techniques, 
we will use B = 70 with all alphanumeric characters and 
some allowable special characters used in SMS. 

String is the transformation of the comma- 

delimited values in base-70. For example, if — [9, 

60,70], then i77o(T^"^*^) =“9,y,10”. The upper bound 
on the length is then: 


len{HYo(T^ 


Xmin 


)) <= ^2^ + ^3^ +2(Af-l) 


dX„ 


+ 2(N - l) + 2iV - 1 


commas 


<= 6A 

and the lower bound will be: 


1 


_ +{N-1) 
+ {N -1) + 2N -1 


dYi 


commas 


(7) 


( 8 ) 


>= 4Ar -1 


Similarly, fixed length coding in base-70 will 

have length strictly AN 1: 

len(H7o(rf™")) = ^ -I- ^ -I-2(Ar - 1)-I-2(iV - 1) 

dXmin dYmin dXi dYi 


= 4A + 1 


(9) 


4. VARIABLE LENGTH ENCODING 

Using the skewed distribution of the delta lengths (Fig¬ 
ure 3), we can significantly improve compression com¬ 
pared to , while still omitting the commas that 

appear in . Similar to the concept of Golomb en¬ 

coding [8], a simple variable-length encoding of the deltas 
can use a single base B bigit b for most deltas, those be¬ 
low B, and three characters —bb for the rest. For deltas 
greater than B, we use the indicator character “-” (mi¬ 
nus) followed by two characters, —bb. Note that reserv¬ 
ing “-” for this pupose means there is one less character 
for the base encoding. 
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(a) dXrr,in G [167,3284] (b) dY^i„ G [456,9932] 



(c) 5Xi e [169, 3301] (d) SYi G [458, 9988] 

Figure 4: Ranges for large values in both transformations. has {dXmin, dYmin), and has (SXi, SYi) 



(a) Distribution of lengths (b) Distribution of lengths 



(c) Distribution of lengths (d) Distribution of lengths 


Figure 5: Distribution of base 70 compressed lengths for both transformations 











Similarly, dXmin and dYmin could be encoded in one, 
three or four characters, using 5, —bb, or -\-bbb, using — or 
+ as the indicator. However, better compression occurs 
if we split dXmin and dYmin ^ or equivalently 5Xi^ and 
5Yi for T^, each into two smaller parts using an agreed 
factor. Using the base H is a particularly appropriate 
choice: 

dXmin = B * dXY^ + or SXi=B^ SX[ + dX[' (10) 

dYmin = B * dYY^ + dYY^ or 5Y^ = B ^ SYl + SYl' (11) 

dXY^ and dY^^^ encoding will each require at most 
three characters due to their distribution shown in figures 
4a and 4b. dXY^ and are guaranteed to be en¬ 

coded using a single bigit each. If H is 70, dXY^ will also 
be encoded using a single bigit. If = [9^ 60, 70, 73], 

then the base-70 encoding would be “9y-10-13”. 

It is important to note that we do not need commas or 
a fixed field to differentiate between the coordinates dur¬ 
ing decoding, since the indicator character, —, will suf¬ 
fice. The variable length encoding will be upper bounded 
by 6N for transformation, and 6N — 6 for 

transformation. While i77o(T^"^*^) length is better than 
these bounds, the skewed distributions of the NWS cor¬ 
pus ensure that the variable length encoding will be most 
often better (see Figure 5b). 


H = 62 for VARi^ 2 )'^: 

len{VARi:[^^^) >= ^ ^ 

dX'^.^ & dX'^.^ dY^^.^ & dY^.^ 

+ (N-1) + (X-1) (13) 

dXi dYi 

>=2N + 2 

For example, VARir""" =“9y+ll”, if: 

T^min ^ [9,60,62,65]. The lower bound for VARi ^2 
remains same as for VAR^"^^'^. 

5. BIGNUM COMPRESSION 

Bignum compression further improves on or 

T^, by combine each delta pair {dXi,dYi) or (AX^, AU) 
into a larger single number: 

dXYi = dXi * XX + dYi, (14) 

where XX can be a fixed choice or chosen based on 
the range of dYi to make sure there is “space” for dYi. 
For instance based on our corpus XX = 350 will be an 
appropriate value to “make space” for dYi because dYi 
is less than 350 for the NWS corpus; similarly AYi is 
always less than 550. 

Likewise, we can combine (dXmin, dYmin) or ((^W, Wi) 
using a larger factor: 


Leveraging both Skew and Limited Range of 
Deltas: We can do even better by further exploiting 
the skewed distributions. To improve on variable length 
encoding, we notice that since all deltas dXi, dYi are less 
than 350 (and AX^, AU are less than 550), those that 
are greater than B, and would normally use —bb do not 
use the full range allowed by bb; instead we will only 
see 16, 26, ... 66 for V (and 16, 26, ... 86) for 
VAR^ which allows us to replace the three character 
—bb with a two character xb, where x is one of several 
unused special characters such as [+*/()%•••]• Likewise, 
the split dXmin uses one char b for each part. dY^i^, or 
5Yi also uses one character. Only dY^i^ or 6Y{ might 
use —bb, but their range is also restricted (no more than 
167), so instead of —bb, we will also use xb. The upper 
bound on the length of after applying variable 

length encoding (VAR): 


dXYmin — dXmin * -|- dYmin , (15) 

where Yfactor can be chosen based on the range of 
dYmin- An appropriate value based on the corpus is 
10,000 to make space for the dYmin- 

Expanding on this pair of deltas idea, we can aggregate 
all deltas into a single large integer, using a simplified 
form of arbitrary precision integer arithmetic. The large 
integer is computed by successive pairing of elements of 

rjnAmin . 

= {BIG^^’”*XX^) + ((dXi + l)*XX) + (dYi + l), 

(16) 

where 1 e [1, — 1], and is 0. We add one 

to all dXi and dYi values to avoid the pathological case 
when delta values are zero.^ For we use essentially 
the same equation: 


len{VARi:l^^^) <= 


+ 2(X- 1) + 2(X- 1) 


( 12 ) 


<= 4X+ 1 


BIGt+i = (BIGt * XX'^) + ((AXi + 1) * XX) + (AYi + 1), 

(17) 

where z G [1, X — 2]. The value XX is chosen for each 
polygon using a ’’indicator” character S defined by: 


S = 


max (sup i dXi, sup^ dYi) 
d 


+ 1 


(18) 


Note again that the base 64 reserves an additional six 
special characters for the allowed 70. Transformation 
needs two more special characters due to the bounds on 
(AXi, AU), len(VAR q 2 ) is upper bounded by 4X — 3. 
The lower bound on the length of VAR^{^^^ is achieved 
when all values are less than H = 64 (and less than 


XX = d^S^l (19) 

^One could choose to allocate less than 6 or 8 characters 
to the xb, but then the ”-” would be needed in a few cases 
^Strictly speaking, only the first value, dXi could cause 
a problem. 
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The value for d can be chosen based on the distribution 
of the deltas. For the NWS corpus, d = 6 ensures XX 
to be large enough to encode any {dXi^dYi)^ while d = 9 
ensures XX will be large enough for (AW, Also, 

the above selection of AX guarantees its encoding via S 
using a single character in base-70 for S'. ® 

Larger values for the starting points in and 

could also be included in BIG in several ways. Firstly 
by choosing an appropriate set of factors based on the 
distribution, such as Xfactor = 3500 and Yfactor = 10000 
for NWS corpus, and then apply the following: 

BIG^rnin ^ (BI *X factor+dX^in)*Yf actor+dYmin 

( 20 ) 

BIG^_j^={BIG^_^*Xfactor + SXl)*Yfactor + SYl (21) 

The X factor and Yf actor make space for their (dXmin, 
dYmin) or (SXi^SYi). The encoded string in base-70 
representation is a concatenation: 

mCrtT'" = S • 


significant number of the delta coordinates in the poly¬ 
gon are larger than thus requiring the longer two- 
character xb representation. Note that actual encoding 
base B is smaller for VAR than for BIG for the same 
set of available characters. 

The best case compression in the BIG technique for 
a polygon would be to pick the compression parameters 
adaptively as follows: 

AA = max (sup dA^, sup dYi) + 2^ 

i i 

Xfactor — dXrnin T 1 
Yf actor — dYfoiiyi T 1 

However, outputting these exact choices as part of the 
string would add too many characters. Indeed, we could 
then directly encode dXmin and dYmin in other ways 
but experiments suggest that will not be any better than 
encoding using the X factor and Yf actor approach. ® 


BIGm = S • BIG^_i 

An estimate of the bounds on BIG can be obtained 
by noticing that we are essentially placing each dXi and 
dYi, or AAi and AYi into an A A sized space, essen¬ 
tially “shifting” by log 2 {XX) bits, concatenating into a 
big number, and then chopping into B sized characters 
(each log 2 {B) bits). Thus allowing one character for 
S and about four characters for the dXmin^ dAi, etc. 
shifted by X factor and Yf actors we get approximately: 


len(BlGt'"*") 

lenima^r^) 


^ factor') Y log2(Yfactor') 

log2{B) 

2(N -l)log 2 (XX) 
log 2 {B) 

2(N - + 5-09 

1092(70) 


( 22 ) 


and 


leniBIGs) ' 


{log2{Xfactor) T log2iYfactor) 
log 2 {B) 

2{N -2)log2{XX) 


log2{B) 

len(BTG%) ^ 2(N - 2 ) ^^"^^^^^ + 

/O5f2(70) 


(23) 


for B = 70, X factor = 3500 and Yf actor = 10000. 

Thus when XX — B^ this is essentially one more than 
the bound on VAR, and gets increasingly better as AA 
decreases below B. Furthermore, at larger AA, BIG 
will be better than VAR for certain polygons when a 

®A slightly better result is obtained if we use 
a piecewise linear approximation of AA(aS') to 
max (sup ■ dAj, sup-dTi), whereby we exactly match for 
AA = 0 : 30, and then more granular from AA = 31 : 
330 (or 31 : 540 for T^). 


6. VARIABLE LENGTH ENCODING 
WITH REPEATED SUBSTRING DIC¬ 
TIONARY (RSD) 

Here we extend the idea of variable length encoding 
further by usage of a dictionary^. The input string to 
this technique is the transformed list of coordinates rep¬ 
resented by VARb^^^ or VAR^- As indicated above, 
each delta will either be a single b or two character xb. 

Inspired by LZW, we exploit the statistical redun¬ 
dancy in polygons across the corpus to generate a static 
dictionary for the entire NWS corpus, and provide that 
to the encoding and decoding systems. This dictionary is 
essentially a set of most frequently repeated three char¬ 
acter sub-strings. We are using the same base B as for 
variable length encoding (VAR) in the dictionary for 
keys and values^®. A sub-string of size two would not 
have performed any extra compression since we need to 
prefix a dictionary value with an indicator character to 
make the distinction of dictionary value in the encoding, 
and non-dictionary value. The size of the dictionary is 
constrained by the available set of characters, and any 

^ Since we add one to all deltas in Eq.l6, A A has to be 
strictly greater than all dXi + 1 and dYi -f 1 such that 
we can decode correctly. Strictly, we only have to deal 
with the first point specially. 

®We can also apply the same VAR splitting approach, 
treating the parts of dXmin and dYmin as four addi¬ 
tional deltas. Thereafter, using a potentially larger A A' 
in place of A A in Eq.l6, and also in Eq.20 in place of 
X factor and Yf actor- This adaptively picking parameters 
holds true for {X[,Y(). However, in the NWS corpus it 
is not as good as the X factory Yf actor approach. 

®A dictionary contains a set of mapping objects (key, 
value) 

Actually, we can use the full allocated 70 character size 
for the table 



Transformation 

Variable 

Value 

rj-iAmin 

o 

[31.3,-97.4,31.51,-97.55,31.8,-96.99,31.58,-96.84,31.3,-97.4] 


O' 

[3130,9740,3151,9755,3180,9699,3158,9684] 


rpAmin 

[1530,3684,0,56,21,71,50,15,28,0] 


rjiAmin 

“1530,3684,0,56,21,71,50,15,28,0” 


rpAmin 

“153003684000056021071050015028000” 


„ -j- ^Amin 

JdI Cjt 

“14818307150871153 03684” 



“Z7YfAH*‘vmYi4” 

rpA 

O 

[31.3,-97.4,31.51,-97.55,31.8,-96.99,31.58,-96.84,31.3,-97.4] 


O' 

[3130,9740,3151,9755,3180,9699,3158,9684] 


rpA 

[1530,3740,42,30,58,111,43,29] 


rjiA 

“1530,3740,42,30,58,111,43,29” 


rpA 

“153003740042030058111043029” 


A 

BIG 

“33202964332840303740” 


BIGt 

“ZBqu20DM8m*y” 


Table 1: Example of compression with different transformations. Notice that is always less in length in compar¬ 
ison to ^ primarily because it has one less point. Base 70 is considered for convenience and easy comparison 

with VAR. 


encoding using the dictionary will be of length two. We 
tried two approaches to construct the dictionary: 

• Fixed field matching: Chop the character string 
into disjoint three character substrings, and keep a 
count of each unique sub-string. 

• Sliding window: Slide a three character window 
across the string and keep a count of each substring 
in the dictionary. Again, the size is restricted by 
the base. 

The fixed field case is easier to implement, and faster 
to execute, but the sliding window gives better results. 

There may be cases when none of the repeated sub¬ 
strings occur in the input, and no extra compression is 
achieved, but there is no penalty, other than a linear or¬ 
der of cost of creating and storing the dictionary, and the 
finding any matching substrings. For any variable length 
encoding using base B, VARb, the encoding with RSD 
will be referred as VARb-i_rsd- B — 1 due to the ex¬ 
tra special character for encoding substrings found in the 
dictionary. 

Table 2 is an example of a static dictionary storing 
70 three character substrings for VAR^^^^^ encodings of 
NWS corpus. 000 occurred 1414 times, whereas OON was 
the last entry in the table which occurred 96 times: 


7. ARITHMETIC ENCODING 

Arithmetic encoding (AE) is a variable length and loss¬ 
less encoding technique. For compression and decom¬ 
pression AE relies on a probabilistic model. The algo- 


key 

base-70 value 

000 

0 

100 

1 

coo 

N 

OON 

- 


Table 2: Repeated Substring Dictionary for variable 
length encoding 

rithm is recursive for each character i.e. it operates upon 
and encodes (decodes) one data symbol per iteration [11]. 

The probability model over the possible characters to 
perform the encoding and decoding steps is essential for 
optimal compression. Specifically, the compression ratio 
depends on how well the probability model represents 
the string of characters to be encoded. For our experi¬ 
ments with polygons the probability of occurrence of any 
character is based on the entire corpus of polygons. 

For the purpose of the polygons, we will define the 
character sequence S = (0123456789). Before applying 
AE, all polygons were transformed to deltas by the same 
heuristics used to get the string. 

Arithmetic encoding is applied to this delta string. 
The basic algorithm is described below. 

1: Begin with the current interval [lo,ho) initialized to 

[ 0 , 1 ). 

2: Sub divide the current interval [/o, ho) proportional to 


9 








Transformation 

Variable 

Value 

rj-iAmin 

o 

[30.97,-92.28 30.89,-92.04 30.61,-92.22 30.65,-92.34 30.97,-92.28] 


O' 

[3197,9228,3089,9204,3061,9222,3065,9234] 


rpAmin 

[1461,3204,36,24,28,0,0,18,4,30] 


VARiT"" 

“Mro4aOSOOI4U” 


T/ A ryAmin 

V RSD 

“NGosaOS@v4U” 

rpA 

O 

[30.97,-92.28 30.89,-92.04 30.61,-92.22 30.65,-92.34 30.97,-92.28] 


O' 

[3197,9228,3089,9204,3061,9222,3065,9234] 


rj-iA 

[1497,3228,15,47,55,36,8,24] 


VARg 

“09q4Flta80” 


V AR^i_[is0 

“OXquFlta80” 


Table 3: Example of compression with different transformations. Notice that @ in V is the indicator 
character to distinguish between a dictionary value and non-dictionary value. Although, VAR^i had no keys in its 
RSD dictionary, and therefore VAR^i ^ VARq 2 


the probability of each character in E. 

3: For each character a of the polygon string, we per¬ 
form two steps: Consider the probability interval for 
Ci, call it [h^Ui) and make it the current interval. Sub¬ 
divide the current interval into subintervals, one for 
each possible character, and defined by probabilities 
over E. 

4: We output enough bits representing the final interval 
[In^Un)^ where n is the length of the polygon string. 

The output from step 3 of the algorithm is a binary 
representation of any real value in the interval [ln,Un)- 
A real value in the final interval uniquely identifies a 
string of characters provided the length of the string to 
be decoded is known by the decoder. In other words, 
each input string generates a unique probability interval 
due to the recursive approach of dividing the probability 
intervals. 

As stated before, we need to embed the length of the 
input string for the decoder to retrieve the original poly¬ 
gon but here we embed the number of coordinates of the 
polygon in the compressed string which is sufficient for 
decompression. 

8. STANDARD METHODS-LZW, 
GOLOMB, HUFFMAN, 7ZIP, GZIP 

LZ78 [15] is a variant of the LZW (Lempel-Ziv-Welch) 
algorithm, implemented for example in the well-known 
GZIP. The basic idea of the LZW algorithm is to take 
advantage of repetition of substrings in the data [2] and 
use a smaller length encoding for such repetitions using 
data structure like a dictionary with one-to-one mapping 
of substrings to encodings. We tried the LZW algorithm 
with both and strings. 

With as an input, we also tried other standard 
string compression algorithms available [12] like 7zip, 


and gzip, but their compressed lengths were not as good 
as our BIG or VAR encoding techniques. 

We compared our results to Golomb coding [8], which 
is well studied technique for input values following a ge¬ 
ometric distribution, essentially where most values are 
small, like our delta distributions. Golomb is essentially 
a concatenation of a variable length unary coded prefix 
(a string of 1 followed by a 0) and a fixed bit length 
remainder. We then encoded this bit string in base B 
characters for both transformations. 

We also compared our results to Huffman encoding [9] 
using probabilities similar to the AE method. Huffman 
builds a coding tree using these probabilities, but due to 
its size (4835 leaf nodes for Amin and 4715 leaf nodes 
for A transformation) this compression technique will 
not be space efficient on a mobile phone. 

9. PRACTICAL CONSIDERATIONS 

In order to embed a compressed polygon string in a 
WEA message, or a simulated WEA message for the tri¬ 
als, we need to signal the start and stop of the polygon 
string, or the start and length, as appropriate. In most 
cases, we could prefix, # or #p and a postfix, ^ or #]. 
In order to embed the polygon in longer text messages 
that might contain a # character, we encode any other 
# as a ##. Furthermore, as indicated, we use base 70 
(or higher) to compress numeric strings. We can use a 
larger base, such as 90, however we need to limit to only 
use characters that can be included in an SMS or broad¬ 
cast message, and exclude any characters that are also 
used by the compression scheme (such as the # sentinel, 
the -, + and other characters used for signs for variable 
length VAR substrings, and the @ indicator in the RSD 
approach). Thus in most cases, the embedded polygon 
will be 2-4 characters longer than the numbers indicated 
above. 

Also for Arithmetic Encoding (AE) although we need 
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Figure 6: Summary of Base 70 results showing mean percentage of compressed length. Our compression techniques 
(shown in blue) have mean compressed length less than 20 with low error (standard deviation) in comparison to some 
standard techniques (shown in green). 


to include the number of characters to be retrieved by the 
decoder, we will need a prefix and a postfix as sentinels. 
For some values of S in BIG, we can save one character 
by using a different sentinel, #q, #r, #s instead of #pS. 

See [5] for discussion of character sets. Sometimes, 
operator gateways used in our experiments would not 
transmit some characters, and so we used base 70, even 
though a higher base would somewhat improve the com¬ 
pression percentages. 

10. POLYALGORITHM 

Because of the variability in the results for each tech¬ 
nique, each has some polygons for which it is the best 
method, leading to consideration of combinations of tech¬ 
niques and adaptive technique selection. 

As indicated below, the two best techniques BIG^{S) 
and VAR^gj^ are close in output character lengths. We 
can create a combined technique, POLY^{S), that uses 
BIG^(S) where it is best and VAR^^d where it is best, 
adding an extra character S' = 0 to signal VAR^^j^ and 
other values of S to signal and control BIG^{S). Since 
BIG^ and VAR^gj^ are so close (with BIG better for 
small XX)^ adding this extra character can swamp the 
benefit. This extra character can be saved in the prac¬ 
tical case by using a different sentinel #q, instead of 
#p0. Note that if we decide that only 70 characters 


are available, we use the full B = 70 as encoding base 
for BIG, but since we are reserving 9 characters for in¬ 
dicators in VAR^gjj, we actually use only B = 61 as 
encoding base for the corresponding V AR^i j^qj^. Thus 
VAR, and VARrsd do “waste” some of the available 
characters. 

It is important to note that BIG should always be bet¬ 
ter than VAR when dXYmax — sup^(dAi, dl^) < B^^, 
which, as can be seen from Figure 3, occurs more than 
65% of the time for B = 62 and more than 80% for B = 

90 for and when AXY^ax = sup^CAX^, AdYt) < 

B which occurs 34% for B — 62 and 62% for B = 90 for 
jiArrnn^ Typically the shorter length of BIG occurs when 
A A is quite a bit less than B. 

So in setting AA(aS') effectively for the polyalgorithm, 
its important to have A A close to the dXYmax or AAl^nax- 
In many cases BIG will be better than VAR, and will 
be substantially better for smaller A A and for those 
polygons when many of their deltas require two base B 
characters for larger AA. 

11. COMPARISON, DISCUSSION & CON¬ 
CLUSIONS 

Figure 6 summarizes the 70 character (usually base 70, 
^^This is the B used in the comparible VAR encoding 
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(a) Compression lengths with as input 

transformation 


(b) Compression lengths with as input 
transformation 


Figure 7: Comparison of compressed polygon lengths with original lengths (O) of polygon strings 


except for the VAR techniques) compression results of 
the 11370 polygons collected from the NWS online por¬ 
tal for 2012, 2013, and through December, 2014. Base 70 
was chosen to incorporate the reserved characters used 
by VAR and VARrsd apart from the alphanumeric 
characters of base 62. We noticed small improvements 
at a higher value 90, particularly in the maximum values 
of lengths, and the reduced standard deviation. 

In general, VAR can be characterized as the best and 
simplest direct technique, with the best compression ra¬ 
tio and least variability. BIG is very close, and as indi¬ 
cated above is better when XX < B and when a signifi¬ 
cant number of deltas require two characters in VAR for 
larger XX. However VARrsd is slightly better over¬ 
all. Because the results are so close, which method is 
ultimately deemed best depends strongly on the specific 
polygon and the overall distribution of the polygons. 

Figures 7a, 7b shows that all of the methods yield sub¬ 
stantial compressions. As seen from the figures and from 
the formulas displayed earlier, the results are essentially 
linear in N. 

BIG^ and VAR^ are the best direct techniques, each 
leading in about 50% of the cases. We observed from 
the detailed results for each polygon that VAR^i rsr is 
better than VAR^ 2 ^ cind for more than 50% of the time 
VAR^i rsd is better than BIGro- Thus we introduce 
the polyalgorithm, POLYjq which is slightly better than 
VAR^i Rs]j overall. 

Figures 8a and 8b compare the best methods using a 
set of 70 available characters, with corresponding encod¬ 
ing base B = 61,62,63,64, or 70 for the different tech¬ 
niques, but we would get similar results with a larger 
base. 

Compression of polygons using kd-trees [4] is known 
to have good compression ratios only when the polygon 


mesh is sparse with few edges [1]. 

As we explored various techniques, we experimented 
with several small optimizations that would occasionally 
save a character. These involved adjusting some of the 
parameters such as Xq^Yq, Xfactor, and Yfactor, chang¬ 
ing the piece-wise linear representation of XX(S) and so 
forth, but overall the parameters presented in this paper 
seemed the best compromise. 

By Shannon’s coding theorem [14] optimal compres¬ 
sion for a set E of symbols in which each symbol c has 
the probability of occurrence pc, then the entropy is given 
by: 

£;(S) = ^-pe'lg2(Pc) (24) 

cGS 

bits from encoding each symbol. Huffman encoding is 
an optimal prefix encoding technique, such that the Huff¬ 
man encoding length is never longer than the entropy of 
the given distribution. In figure 9, we see that BIG and 
VAR are close to the almost optimal Huffman encoding. 
None of the other techniques like 7zip, LZW, and gzip 
were as good on the NWS corpus as these two. 

12. FUTURE WORK 

Future work includes exploring even more use of sta¬ 
tistical skew, such as an extended form of variable length 
encoding. This approach will allow us to take the variable- 
length encoding strategy further by starting from a base- 
2 (binary) representation, and using one or two bits to 
determine the length of the following field. We will fur¬ 
ther explore this more Golomb-like option. 

We also plan to use integer programing to find the 
near optimal set of compression parameters such that we 
maximize the leverage of any skew in the delta values. 
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(a) Compression lengths with as input 

transformation 



(b) Compression lengths with as input 
transformation 


Figure 8: Comparison of best techniques 




(a) Compression lengths in bits with as input (b) Compression lengths in bits with as input 

transformation transformation 

Figure 9: Comparison of compressed polygon lengths with Shannon bound 
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